Presentation

Group 4

Introduction

  • CRC has the 3rd highest mortality rate

  • More effective methods to detect CRC are needed

  • Correlation between exosomes and tumorigenesis

  • miRNA and mRNA can serve as biomarkers - these we want to find!

GEO data

  • Using the library GEOquery, the data was loaded -> no need to download any files

  • Data was already standardized, though we still did some cleaning

miRNA from the GEO data Frequency of the different states of cancer

Data analysis - GSE

  • Log2 Transformation:
    • Data points below 0 were removed and converted to NaN.
  • Design Matrix for Tumor vs Normal Comparison:
  • Limit Expressed Genes Based on Median Expression:
    • Only genes with above-median expression in 1/3 of the samples were retained.
  • Linear Model Fitting and Contrasts:
    • A linear model was fitted.
    • Contrasts were created between the two groups (Tumor and Normal).
    • Empirical Bayes’ step was performed to obtain statistics and p-values.
  • Array Weights for Model Fitting:
    • An array of weights was created to fit the data into the model.
  • Empirical Bayes’ Step (Again):
    • The empirical Bayes’ step was applied again with the array weights.
  • Tidying data

Data retrieval - TCGA

  • Fetch analyte.tsv & clinical.tsv from raw_/

    • Obtain IDs of the patients for whom the RNA expression was registered
  • Library TCGABiolinks is used to retrieve data from the GDC data portal

    • Function: retrieve_and_prepare()

      • GDCquery: Query to specify the data to get

      • GDCdownload: Downloading the samples from the query

  • Example:

miRNA_data_cancer <- retrieve_and_prepare_data(
  project = "TCGA-COAD",
  data_category = "Transcriptome Profiling",
  data_type = "miRNA Expression Quantification",
  workflow_type = "BCGSC miRNA Profiling",
  id_cancer_patients = id_cancer_patients_cancer,
  directory_prefix = "samples_miRNA"
)

  • miRNA data - 2 separate dataframes

  • mRNA data - Large SummarizedExperiment

Data description - TCGA

  • Data tidying -
    • Like families, tidy datasets are all alike but every messy dataset is messy in its own way
  • Data visualizations:
    • Number of cancer vs. normal samples; Gender distribution; Pathological cancer stages

Distribution of Age by Gender Frequency of AJCC Pathologic Stages by Status

Data preprocess - TCGA

  • Creating metadata for patients ID (TCGA # - Tissue Status) for mRNAs and miRNAs.
  • Log2 transformation of two datasets and adjustment of the organizations. For them to have the same style of a tidy data: rows = gene names, columns = patients ID, value = expression data.

Log2 transformed data, miRNA Log2 transformed data, mRNA

Data augmentation - TCGA, edgeR

  • Calculation of the normalization factors for the data (_log) with calcNormFactors and imputation of NAs using means.

  • Running a “universal” edgeR differential analysis function with a quasi likelihood model.

Statistics table Final _aug dataset

Results (All)

  • Volcano Plot:
    • Additional columns were created for volcano plotting:
      • ‘diffexpressed’ with values: NO, UP, and DOWN.
      • ‘label’ with GENE_ID’s of overexpressed genes.

TCGA mRNA TCGA miRNA GSE miRNA

Results

  • Heatmaps:

TCGA mRNA TCGA miRNA GSE miRNA

Conclusions

  • Although we were able to follow the article’s instructions, there are significant differences in our results. It might be brought on by some extra measures taken during data preprocessing, or by the authors’ sparse information. It would be wise to get in touch with the authors to inquire further about preprocessing and data retrieval.

  • Overall, our analysis was carried out accurately, and the results did not indicate any grave errors.

  • In addition, data we used has different amount of sample in each stages, and stages differ between TCGA and GSE datasets